Oligo-distance: a Sequence Distance Determined by Word Frequencies
نویسندگان
چکیده
Differences in the frequencies of chemical words of a given length in two nucleic sequences are used to define an “oligo-distance” between the sequences. Oligo-distances are much easier and faster to compute than the distances conventionally determined by sequence alignment. A correlation between oligo-distance and alignment-distance is observed. The two kinds of distances are used to construct phylogenetic trees for artificially generated sequences and for a set of thirty-five 16S and 18S rRNA sequences. The gross topologies of the trees given by the two kinds of distances are identical when the sequences are complete but only the oligo-distance is robust against sequence deformations such as rearrangement, truncation and random concatenation.
منابع مشابه
Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of...
متن کاملAlignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping.
The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important st...
متن کاملThe Intellectual Structure of Knowledge in the Field of Distance Education Using the Co-Word analyses
Background: Co- word analysis is one of the content analysis methods used in scientometric studies and mapping the scientific structure of various fields. The purpose of the present research is to map the structure of distance education using the co-word analysis. Methods: The research method is content analysis using co- word analysis. The research population are 31607 documents indexed in the...
متن کاملStatistical measures of DNA sequence dissimilarity under Markov chain models of base composition.
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biom...
متن کاملPhylogenetic Analysis of Some Luffa Genotypes Based on the sequence of intergenic region of trnH-psbA
Luffa (Luffa cylindrica) is a plant from the Cucurbitaceae family that grows mostly in tropical and subtropical regions, as well as in most regions of Iran. In this research, the genetic diversity of nine native and non-native genotypes of L. cylindrica was investigated through the evaluation of the chloroplast trnH-psbA intergenic region (IGS). After sampling the young leaves, DNA extraction w...
متن کامل